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Abstract 

Finding coarse-grained, low-dimensional descriptions is an important task in the 
analysis of complex, stochastic models of gene regulatory networks. This task 
involves (a) identifying observables that best describe the state of these complex 
systems and (b) characterizing the dynamics of the observables. In a previous 
paper [13], we assumed that good observables were known a priori, and pre- 
sented an equation-free approach to approximate coarse-grained quantities (i.e, 
effective drift and diffusion coefficients) that characterize the long-time behavior 
of the observables. Here we use diffusion maps [9] to extract appropriate observ- 
ables ( "reduction coordinates" ) in an automated fashion; these involve the leading 
eigenvectors of a weighted Laplacian on a graph constructed from network simu- 
lation data. We present lifting and restriction procedures for translating between 
physical variables and these data-based observables. These procedures allow us to 
perform equation-free coarse-grained, computations characterizing the long-term 
dynamics through the design and processing of short bursts of stochastic simula- 
tion initialized at appropriate values of the data-based observables. 
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1 Introduction 



Gene regulatory networks are complex high- dimensional stochastic dynamical systems. 
These systems are subject to large intrinsic fluctuations that arise from the inherent 
random nature of the biochemical reactions that constitute the network. Such fea- 
tures make realistic modeling of genetic networks, based on exact representations of the 
Chemical Master Equation (such as the Gillespie Stochastic Simulation Algorithm, SSA 
[T8] ) computationally expensive. Recently there has been considerable work devoted 
to developing efficient numerical algorithms for accelerating the stochastic simulation 
of gene regulatory networks lU [191 [131 [32] and, more generally, of chemical reaction 
networks. Many of these techniques are based on time-scales separation and classify the 
biochemical reactions as "slow" or "fast" [321 El [211 [121 E] • this paper we combine 
such acceleration methods with recently developed data-mining techniques (in particu- 
lar, diffusion maps [9l[T0l[30]) capable of identifying appropriate coarse-grained variables 
( "observables" , "reduction coordinates") based on simulation data. These observables 
are then used in the context of accelerating stochastic gene regulatory network sim- 
ulations; they guide the design, initialization, and processing of the results of short 
bursts of full-scale SSA computation. These bursts of SSA are used to numerically solve 
the (unavailable in closed form) evolution equations for the observables; such so-called 
equation-free methods [25j for studying stochastic models have been successfully ap- 
plied to complex systems arising in different contexts [20l [23l [39] . In the context of gene 
regulatory networks - but with known observables - equation free modeling has been 
illustrated in [13]; here we extend the approach to the more general class of problems 
where appropriate observables are unknown a priori. 

We describe the state of a gene regulatory network through a vector 



where Xj are the numbers of various protein molecules, RNA molecules and genes in the 
system. The behavior of the gene regulatory network is described by the time evolution 
of the vector X(t). For naturally occurring gene regulatory networks the dimension N 
of the vector X(t) is, in general, moderately large, ranging from tens to hundreds of 
species. However, the temporal evolution of the network over time scales of interest can 
be often usefully described by a much smaller number n of coordinates. For example, 
in [13], we studied various models of a genetic toggle switch with N = 2, N = 4 and 
N = 6 components of the vector X; yet in all cases, the slow dynamics was effectively 
one-dimensional, and a single linear combination of protein concentrations was sufficient 
to describe the system, i.e. n = 1. In this paper we show how, for this genetic network 
system, good coarse variables can be found by data-mining type methods based on the 
diffusion map approach. 

This paper is organized as follows: We begin with a brief description of our model in 
Section [21 Section [3] quickly reviews the equation-free approach for this type of bistable 
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dynamics. Given a low-dimensional set of observables, the main idea is to locally esti- 
mate drift and diffusion coefficients of an unavailable Fokker-Planck equation in these 
observables from short bursts of appropriately initialized full stochastic simulations. In 
Section 14.11 we show how to process the data generated by stochastic simulations to 
obtain data-driven observables through the construction of diffusion maps [301 129] . The 
leading eigenvectors of the weighted graph Laplacian defined on a graph based on simu- 
lation data suggest appropriate "automated" reduction coordinates when these are not 
known a priori. Such observables are then used to perform "variable-free" computa- 
tions. In Section [5] we present lifting and restriction procedures for translating between 
physical system variables and the automated observables. The bursts of stochastic sim- 
ulation required for equation-free numerics are designed (and processed) based on these 
new coordinates. This combined "variable-free equation-free" analysis appears to be 
a promising approach for computing features of the long-time, coarse-grained behavior 
of certain classes of complex stochastic models (in particular, models of gene regula- 
tory networks), as an alternative to long, full SSA simulations. The approach can, in 
principle, also be wrapped around different types of full atomistic/stochastic simulators, 
beyond SSA, and in particular accelerated SSA approaches such as implicit tau-leaping 
[33] and multiscale or nested SSA [61 [T2]. 



2 Model Description 

Our illustrative example is a two gene network in which each protein represses the 
transcription of the other gene (mutual repression). This type of system has been 
engineered in E. coli and is often referred to as a genetic toggle switch [161 122]. The 
advantage of this simple system is that it allows us to test the accuracy of computational 
methods by direct comparison with results from long-time stochastic simulations. More 
details about the model can be found in and in our previous paper [TH]. The 
system contains two genes with operators Oi and O2, two proteins Pi and P2, and the 
corresponding dimers, i.e. = 6 in (11.11) . The production of Pi (P2) depends on the 
chemical state of the upstream operator Oi (O2). If Oi is empty then Pi is produced 
at the rate 71 and if Oi is occupied by a dimer of P2, then protein Pi is produced at 
a rate ei < 71. Similarly, if O2 is empty then P2 is produced at the rate 72 and if O2 
is occupied by a dimer of Pi, then protein P2 is produced at a rate €2 < 72- Note that, 
for simplicity, transcription and translation are described by a single rate constant. The 
biochemical reactions are (compare with [13] ) 



7iOi+eiP2P20i 72 02+e2PiPi02 

^ Pi ^ P2 (2.1) 

(5i <52 



Pi + Pi ^ PiPi P2 + P2 ^ P2P2 (2.2) 

k-i k-2 
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P2P2 + 0i~^ P2P2O1 Pi Pi + 02^=! P1P1O2 (2.3) 

where overbars denote complexes. Equations (12.11) describe production and degradation 
of proteins Pi and P2. Equations (12.21) are dimerization reactions and equations (12.31) 
represent the binding and dissociation of the dimer and DNA. 
The state vector for our system is 

X= [Pi, P2,iy^,iV^, 01,02] (2.4) 

where Pi and P2 are numbers of proteins, PiPi and P2P2 are numbers of dimers and 
Oi G {0, 1} and O2 G {0, 1} are states of operators. Assuming that we have just one 
copy of Gene 1 and one copy of Gene 2 in the system, then the values of Oi and P2P2O1, 
resp. O2 and P1P1O2, are related by the conservation relations, namely 

P2P2O1 = 1 - Oi, resp. P1P1O2 = 1-02. 

By virtue of (12. ip . Oi = 1 means that the first protein is produced with rate 71, while 
Oi = means that it is produced with rate ei < 71 (similarly for the second protein). 

Models such as the one defined by ( 12. ip - ( 12. 3p can be validated experimentally, 
by comparing their predictions with steady-state distributions of protein abundances 
obtained through single cell fiuorescence measurements of intercellular variability in 
protein expression levels. 



3 Brief review of equation-free computations 

Suppose we have a well-stirred mixture of N chemically reacting species; furthermore, 
assume that the evolution of the system can be described in terms oi n < N slow vari- 
ables (observables) . In the following we assume that n = 1, and denote this variable Q. 
The approach carries through for the case of a relatively small number of slow variables 
as well. The variable Q might be the concentration of one of the chemical species or 
some function of these concentrations (e.g. a linear combination of some of them). In 
Section 14.11 we show how variable-free methods can be used to suggest an appropriate 
Q. Let R denote a vector of the remaining (fast, "slaved" system observables) which, 
together with Q provide a basis for the simulation space. Our assumption implies that 
(possibly, after a short initial transient) the evolution of the system can be approxi- 
mately described by the time-dependent probability density function /(g, t) for the slow 
variable Q that evolves according to the following effective Fokker-Planck equation |34j : 

f (?,^) = I [-V{q)f{q.t) + |[I^(g)/(g,t)]) . (3.1) 

If the effective drift V{q) and the effective diffusion coefficient D{q) are explicitly known 
functions of g, then (13.10 can be used to compute interesting long-time properties of the 
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system (e. g., the equilibrium distribution, transition times between metastable states). 
Assuming that (13.11) provides a good approximation [HI El], and motivated by the 
formulas 



D{q) 



V{q) 



lim 

At^O 




< Q{t + At)-q\ Q{t) = q> 
At 

< [Q{t + At) - g]2 I Q{t) = q> 



(3.2) 



(3.3) 



we used in [201 ISSl [271 EH] the results of short 5-function initialized simulation bursts 
to estimate the average drift, V, and diffusion coefficient D. Note that, in our context, 
the limit At ^ in equations (13. 2p and (13. 3p should be interpreted as "At small, but 
not too small" , i.e. the short bursts are short in the time scale of the slow variable, 
yet long in comparison to the characteristic equilibration time of the remaining system 
variables. 

The steady solution of (13. ip is proportional to exp[— /3$(g)], where the effective free 
energy $(g) is defined as 



Consequently, computing the effective free energy and the equilibrium probability dis- 
tribution can be accomplished without the need for long-time stochastic simulations. 
A procedure for computationally estimating V{q) and D{q) is as follows: 

(A) Given Q = q, approximate the conditional density P{r\Q = q) for the fast 
variables R. Details of this preparatory step were given in |13j . 

(B) Use -P(r|Q = q) from step (A) to determine appropriate initial conditions 
for the short simulation bursts and run multiple realizations for time At. Use 
the results of these simulations and the definitions (13. 2p and (13.30 to estimate the 
effective drift V{q) and the effective diffusion coefficient D{q). 

(C) Repeat steps (A) and (B) for sufficiently many values of Q and then compute 
<l>(g) using formula (13. 4p and numerical quadrature. 

Determining the accuracy of these estimates and, in particular, the number of replica 
simulations required for a prescribed accuracy, is the subject of current work. An impor- 
tant feature of this algorithm is that it is trivially parallelizable (different realizations 
of short simulations starting at "the same g" as well as realizations starting at different 
q values can be run independently, on multiple processors). 

A representative selection of equation- free results from our previous paper [13], for a 
stochastic model of a gene regulatory network, is provided in Figure [H In [13] the (good) 
observable Q was assumed to be known a priori. The upper left panel in Figure [1] shows 
a sample time series of Q, clearly indicative of bistability, generated using the stochastic 
model, while the upper right panel shows the effective free energy /3$ computed using 



mq) = - 1 



' V{q') 



dq' + In D{q) + constant. 



(3.4) 



D{q') 
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Figure 1: Summary of equation-free results from [13]. To compute the figures, we used 
model { \2.1\i - { \2. 3^ where equations { \2.^ - { \2. 3^ were assumed to be at quasi- equilibrium, 
for parameter values see caption of Figure 5 in [13J. 

f l3.4l) as the parameter 7 = 71 = 72 is varied. The equation-free steady state distribution 
of Q obtained from this effective free energy is in excellent agreement with histograms 
produced using long-time simulation (lower left panel). Equation-free computation has 
also been used |T3] to compute "stochastic bifurcation diagrams" (an example is shown 
in the bottom right panel of Figure [1]) using an extension of deterministic bifurcation 
computation ^38j. We believe this array of equation- free numerical techniques holds 
promise for the acceleration of computer-assisted analysis of gene-regulatory networks. 
We now extend this analysis to systems where the "good" observables are unknown a 
priori by describing diffusion-map based variable-free methods. 
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4 Variable-free methods 



4.1 Theoretical framework 

To find a good, low (n-) dimensional representation of the full A^-dimensional stochastic 
simulation data, we start by exploring the phase-space of most likely configurations 
of the system through extensive stochastic simulations; these configurations X (or a 
representative sampling of them) at, say M different times are stored for processing. 
From M such recordings we obtain a set of M vectors X*^ X*^^^ in which 
constitute the input to the diffusion map dimensionality reduction approach we will now 
describe. A crucial step for dimensionality reduction is the definition of a meaningful 
local distance measure between configurations. For continuous systems with equal noise 
strengths in all variables, one may use the following pairwise similarity matrix 



Wij = exp 



ix(^) - x( 



a 



(4.1) 



where || ■ || is the standard Euclidean norm in and a is a characteristic scale for 
the exponential kernel which quantifies the "locality" of the neighborhood in which the 
Euclidean distance is considered (dynamically) meaningful |9]. 

For discrete chemical and biological reactions, as well as in other systems where the 
components of the data vectors may be disparate quantities varying over different orders 
of magnitude (possibly including even Boolean variables), the simple Euclidean norm 
in equation (14.11) with a single scaling factor a equal for all components may, of course, 
not be appropriate. In this case, it is reasonable to consider different scalings for the 
different components, using an iV-dimensional weight vector 

a = [ai,a2, . . . ,aAr] (4.2) 

where Oj > 0, for z = 1, . . . , A^, and define a weighted Euclidean norm 

N 

\\X\\l= 5^(a,X,)^ (4.3) 

i=i 

This norm replaces the standard Euclidean norm in equation (14.11) . where we may now 
choose cr = 1, since this scaling can be absorbed into the vector a; thus we replace (14.10 

by _ 

W^i, =exp[-||X«-X(^')||^]. (4.4) 

The elements of the matrix W are all less than^or equal to one. Nearby points have 
Wij close to one, whereas distant points have Wij close to zero. In the diffusion map 
approach, given a G [0,1] (the choice of this parameter value is discussed later), we 
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define the matrix W by 

E^H (E^^m (4.5) 

Next, we define a diagonal M x M normalization matrix D whose values are given by 

M 

Du = Y,W^k (4.6) 

k=l 

Finally we compute the eigenvalues and right eigenvectors of the matrix 

K = D^^W. (4.7) 

In this paper we will mainly work with the parameter a = 0. However, in other appli- 
cations different values of a may be more suitable (see Appendix A ). As discussed in 



[30l [3l [29] , if there exists a spectral gap among the eigenvalues of this matrix, then the 
leading eigenvectors may be used as a basis for a low dimensional representation of the 



data (see Appendix A). To compute these eigenvectors, we can make use of the fact 
that 

K = D-^/2gj3i/2 ^j^g^g g _ D-1/2WD-1/2 (4 8) 

is a symmetric matrix. Hence, K and S are adjoint and they have the same eigenvalues. 
Since S is symmetric, it is diagonalizable with a set of M eigenvalues 

Ao > Ai > . . . > Am-1 (4.9) 

whose eigenvectors U^, j = 1,...,M form an orthonormal basis of R^. The right 
eigenvectors of K are given by 

V^- = B-'^/'^Vj. (4.10) 

Since K is a Markov matrix, all its eigenvalues are smaller than or equal to one in 
absolute value. Moreover, if the parameter a in (14.1 p is large enough (and, thus, the 
norm vector in fl4.4p is "small enough"), all points are (numerically) connected and the 
largest eigenvalue Aq = 1 has multiplicity one with corresponding eigenvector 

Vo = [l,l,...,l]. (4.11) 

We define the n-dimensional representation of A^-dimensional state vectors by the fol- 
lowing diffusion map 



; (4.12) 



that is, the point X*^*-* is mapped to a vector containing the i-th coordinate of each of 
the first n leading eigenvectors of the matrix K. This mapping ^„ : ^ is defined 
only at the M recorded state vectors. We will show later that it can be extended to 
nearby points in the A^-dimensional phase space, without full re-computation of a new 



matrix and its eigenvectors. In [Appendix A| we provide a theoretical justification for 



this method as a dynamically useful dimensionality reduction step. 
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4.2 Computation of data-based observables 



We replaced (14.11) by (14.41) where the weight vector (14.21) needs to be further specified. 
Two natural choices for the values of components of the weight vector a = [oi, 02, . . . , ctv] 
immediately arise. One option is to regard the absolute values of the components of the 
state vector X as of "equal importance", i.e. 

ak = uj, for A; = 1,2, ...,iV, (4.13) 

where u; is a single method parameter; this is identical to the use of a single a in equation 
14.11 namely a = uj~^. 

The above approach uses the Euclidean distance between data vectors as the basis 
for graph Laplacian construction and eigenanalysis. In our case, the components of these 
vectors are concentrations of different species (e.g. integer numbers of protein molecules, 
each with its own range over the data set). Moreover, the data vectors contain integers 
(0 and 1) representing states of Boolean operators. This motivates a second natural 
choice of the weight vector a = [ai, a2, . . . , otv]- We rescale the state vector X to span 
the symmetrical domain (cube) in iV-dimensional space, i.e. 

a, = . for k = l,2,...,N, (4.14) 

max — mm 

i i 

where the maximum and minimum values are computed over alH = 1, . . . , M. Formula 
(I4.14P implies that components of the vector X*^*^ — X'--'^ i, j = 1, . . . , M, satisfy 

X« - Xi'^ E for = 1, . . . ,iV, t,j = 1, . . . ,M. 

The difference between (I4.13P and (14.141) is that the first formula implicitly assumes that 
the fiuctuations in different components of the state vector X are equally important, 
i.e. the absolute values of fluctuations are important. Formula (14.141) on the other 
hand implies that relative changes (compared to the maximal observed change) in each 
component are more representative than the absolute values of the changes. We will see 
below that (I4.13P appears more suitable for our variable-free analysis. 



4.2.1 Comparison of (llTTBb and dHHD 

Using our illustrative gene regulatory network example (12. ip - (12. 3p we now study the 
dependence of the eigenvectors of the matrix K on the weighting vector [ai, 02, . . . , oat]. 
We run the long-time Gillespie based stochastic simulation of (12.11) - (12.31) to obtain a 
representative set of M state vectors using the following dimensionless stochastic rate 
constants 7i = 72 = 1-14, Ei = 82 = 0, 81 = 62 = 7.5 x 10""^,^! = /c2 = 10~^, 
= /c_2 = 10, koi = ko2 = 0.4, fc_oi = fc_o2 = 10. After removing initial transients, 
we started recording the values of the state vector (12. 4p every 2 x 10® SSA time steps. 
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L/J 


^0 


^1 




^3 


0.02 


1.00000 


0.99986 


0.94506 


0.91360 


0.01 


1.00000 


0.99920 


0.77757 


0.71122 


0.005 


1.00000 


0.99279 


0.44352 


0.35515 


0.002 


1.00000 


0.76262 


0.10715 


3.3 xlO-2 


0.001 


1.00000 


0.28346 


1.2 xlO"^ 


1.1 xlO^^ 


0.0005 


1.00000 


7.5 xlO-2 


1.0 xlO-3 


1.5 xlO-^ 



Table 1: Top eigenvalues of matrix K computed using for a = in 

We made 2000 recordings to obtain a data file with M = 2000 state vectors. Next, 
we use these state vectors X*^*-* to compute the M x M matrix K and its eigenvectors. 
We use formula (14.131) to compute W and D by (14.41) . (14.51) and (14. 6p . Then we use 
implicitly restarted Arnoldi methods (ARPACK package [28]) to find the eigenvectors 
corresponding to the highest eigenvalues of the symmetric matrix S given by (14. 8p . 
Finally, we compute the eigenvectors of K = D~^W by (14.101) . 

The formula (14.131) has a single parameter u which is free for us to specify. It is 
easy to check numerically that the larger the "local neighborhood" size selected (that 
is, the smaller the u value) the denser the connections between datapoints in the graph. 
Table [1] shows the highest eigenvalues for different values of u. We already know from 
[T2j that the system is effectively one-dimensional. A good observable for the system is 
known to be Q = Pi — P2, i.e. the difference between the first two coordinates of the 
state vector. However, the protein concentrations Pi or P2 were also found to give good 
equation-free results. 

We plot the "empirical" good observable of each data point i (its Pi component, i.e. 
x[^\ or the difference of its Pi and P2 components, i.e. Q = X2*'' — x|*^) versus the one- 
dimensional representation \E'i(X*^*^) (see (14.121) ) of the point. The results are given in 
Figure [2] for two different values of uj. The fact that the empirical coordinate Q appears 
to effectively be one-to-one with the "automated" coordinate \E'i(X'^*^) for all points in 
the data set confirms that Q is indeed a good coordinate for data representation (the 
figure clearly shows Q as the graph of a function above \E'i(X*^*)), i.e. that the relation 
between Q and ^i(X*-*^) is one-to-one). The Pi vs. \E'i(X*^*'') graph confirms that Pi 
is also a good observable; it also is approximately one-to-one with \E'i(X^*^), yet the 
slightly "fat curve" suggests that Q is a "better" observable. 

The dependence of the variable-free results on the value chosen for u may be rational- 
ized through equation (14.11) . As discussed in Section IT2l our parameter u is analogous 
to an inverse "cutoff length" in the computation of the diffusion map kernel; if it is 
too large, then the graph becomes disconnected. Clearly, it is a model parameter that 
has to be optimized depending on the problem; our results for uj = 0.0005 show a pure 
linear relation between the "empirical" Q and the "automated" \l/i(X^*'*) observables. 
Increasing by a factor of 2 corresponds to raising the elements of the matrix W to the 
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Figure 2: Variable-free results using formula l \4-13^ and uj = 0.002 (left panels) or 
uj = 0.0005 (right panels). We plot ^i(X'^*^) which corresponds to eigenvalue Ai as 
function of Q = P2 — Pi (top panels) and as function of Pi (center panels). We also 
plot \1/2(X'^*)) which corresponds to eigenvalue A2 as function of Q = P2 — Pi (bottom 
panels). 



11 




Figure 3: Variable-free results using formula (4-^4) md u = 2 (left panel) or u = 
0.1 (right panel); datapoints colored according to gene states: black=[0,0], green=[0,l], 
blue=[l,0], and red=[l,l]. We plot \l/i(X*^*'') which corresponds to eigenvalue Ai as 
function of Q = P2 — Pi- 



fourth power. This change in weight factor (followed by the normalization of ( 14. 6p ) leads 
to a different clustering of the data points. Large u implies that Euclidean distances 
are meaningful when small; this results in a "more clustered" data set, where nearby 
data points (e.g. points within one potential well) appear (in diffusion map coordinates) 
relatively closer, while points far away (e.g. points in different potential wells) appear 
(in diffusion map coordinates) relatively more distant. Indeed, in the case of continuous 
variables, in the limit of large u the eigenvectors of the diffusion map converge to the 
eigenfunctions of a corresponding Fokker-Planck diffusion operator. In the case of two 
deep potential wells, this eigenfunction is approximately constant in the two wells with 
a sharp transition between them. This might explain the slightly flat regions at the two 
edges of the apparent curve in the middle panel of Figure [2] for 00 = 0.002; points within 
the same potential well may differ in Q, yet appear more nearby in the "automated" 
observable. We also include a plot of the relation between Q and the component of the 
data in the second eigenvector \E'2(X''*'') for comparison. 

Next we show that the weight vector computed using the formula (14.141) (based 
on the magnitude of relative state variable changes) is unsuitable for our variable-free 
analysis. We use the same set of M = 2000 state vectors X*^*) to compute the M x M 
matrix K and its eigenvectors, using formula fl4.14p to compute W and D by (14. 4p . 
(14.51) and (14. 6p . A single parameter u still remains to be specified in formula (I4.14p . 
We now again compare the "empirical" and "automated" observables of all data points 
(Q = Pi — P2 as a function of \I'i(X(*)), the one-dimensional representation based on the 
first nontrivial eigenvector of the matrix K). The results are given in Figure [3] for two 
different values of u). We see that the data split into four curves. Each curve corresponds 
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to a distinct combination of gene operator states (actually, two of the curves effectively 
coincide). There are exactly four possibilities of gene states taken from the set 

[Oi,02] e {[0,0], [0,1], [1,0] [1,1] 

If we use formula f l4.13p . then the contribution of the distance between gene operator 
states to the data Euclidean distance is negligible compared to the fluctuations of the 
protein numbers. Local distances computed using the scaling in formula ( 14.14p are 
clearly not representative of the similarity of nearby (in this metric) points for the 
system dynamics: there is no one-to-one correspondence between the empirically known 
"good observable" Q and the "automated" \l/i(X'^*)) . Indeed, for the parameter values 
of our simulation, transitions between the and 1 states of the operators are very fast 
("easy"); on the other hand the Euclidean distance of two data points that differ only 
in these states is large when computed through the formula fl4.14p . 

An alternative approach to computing the effective rate in (12. ip can be obtained 
assuming that the reaction (12. 2p is fast and that we have a lot of protein molecules in the 
system. Then the quasi-steady state assumption gives the formula Pi Pi = 2fci/A;_iPf. 
Hence, we can write the number of dimers as a simple function of the number of monomer 
proteins. On the other hand, using the same approximation in equation (12. 3p . we obtain 

^ _k-ol _ 

Equation (I4.15P gives Oi as a real number in the interval [0, 1]. This number is a good 
approximation for computing the effective rate in (12.11) . However, it is not a value of the 
Boolean variable Oi - it is only a probability that the gene "is on" at the given time. 

If, on the other hand, the "on-ofF' operator transitions were slow, then Figure [3] 
would be quite informative: it would suggest that we should augment our observables 
with the Boolean variables Oi, O2, since these are "slow". Because of the Boolean 
nature of the gene operator variables, it is not possible to know a priori how often these 
transitions occur, and, consequently, how to scale the quantized Boolean state distance 
so that it "meaningfully" participates in the Euclidean distance used for diffusion map 
analysis. As our diffusion map computations stand, we do not take into account the 
temporal pToximitj of points - when they have been obtained from the same transient. If 
such information is taken into account, it is conceivable that temporal proximity would 
provide guidance in choosing the components of weight vectors (especially for Boolean 
variables which change in a quantized manner) so that "local" Euclidean distances are 
indeed representative of the dynamical proximity between data points. 



5 Variable-free computations 

We now couple the above automated detection of observables with the equation-free 
computations in [13] in what we will refer to as "variable-free, equation-free" methods. 
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The results in this section are for the model parameter values given in Section 14.2.11 
using the weight vector defined by fl4.13p with oj = 0.0005 and kernel parameter a = 
(the standard, normalized graph Laplacian) in fl4.5p . 

The data plot in terms of the observable Q and the component in the eigenvector 
^i(X») in Fi gure [2] suggested that a single diffusion map coordinate, denoted Qdmap = 
\l'i(X^*'*), is sufficient to characterize the system dynamics. The diffusion map coordinate 
is found by performing the eigencomputations described in Section l^TD using the full state 
vector (A^=6) at each of the M = 2000 recorded SSA datapoints (every 2 x 10^ SSA 
time steps) as input to our numerical routines. 

In our previous paper [13] we described an approach to compute an effective free 
energy potential in terms of the observable Q = P2 — Pi. Variable- free computation 
of the effective free energy is now feasible using a similar approach modified to analyze 
simulation data in terms of the coordinate Qdmap- Figure H] plots the effective potential 
/?$ in terms of the automated reduction coordinate Qdmap- To evaluate the effective drift 
{V) and diffusion (D) coefficients required in the construction of the effective free energy 
(equation (13. 4p ) we choose a value of Qdmap, locate instances when it appears in the sim- 
ulation database, record its subsequent evolution within a fixed time interval, and then 
average over these instances to estimate the rate of change in the mean and the variance. 
This procedure is repeated for a grid of Qdmap values enabling numerical evaluation of 
the integral in equation (13.40 . The result of this analysis is compared in Figure H] with 
the potential obtained by directly constructing the probability distribution f (Qdmap) 
from the time series and employing the relationship l3^{Qdmap) ~ — log [fiQdmap)]- 

Section[52]describes a lifting procedure that allows short bursts of simulation, instead 
of long time simulation, to be used in variable-free estimation of effective drift and 
diffusion coefficients. The central idea of "variable-free equation-free" methods is to 
perform equation-free analysis in terms of diffusion map variables, based on short bursts 
of SSA simulation in the original variables. This strategy requires an efficient means 
of converting between the physical variables of the system and those of its diffusion 
map (a restriction step) and vice versa: lifting from the diffusion map back to physical 
variables. For small sample sizes, eigendecomposition of the symmetric kernel S (defined 
in (14. 8p ) yields the diffusion map variables for each data point; yet, as the number of 
sample datapoints increases, the associated computational costs become prohibitive. 
The Nystrom formula [2],|1] for eigenspace interpolation is a viable alternative to repeated 
matrix eigendecompositions for computing diffusion map coordinates of new datapoints 
generated during the course of a simulation. Eigenvectors and eigenvalues of the kernel 
S are related by SUj = AjUj, or equivalently 




where [/^(X^*)) denotes the component of the j^^ eigenvector associated with state vector 
X(*). Eigenvector components associated with a new state vector X"^*^*" cannot be com- 



14 




-1 1 

^dmap 



Figure 4: Effective free energy /5$ as a function of Qdmap from binning of all datapoints 
using an SSA database of 2^"^ time steps (blue lines) and computed from numerical 
integration of equation l \3.4^ using a 2^^ point subsampling (keeping 1 out of every 8 
points) of this database (red lines). Numerical integration performed using a more severe 
subsampling of the database with 2^^ points produces an effective free energy profile with 
an unacceptable level of noise. 



puted directly from (15.11) because entries of the matrix S are defined only between pairs 
of datapoints in the original dataset. Defining the M x 1 vector W^*^*" of exponentials 
of the negative squares of the distances between the new point and database points by 

Wr"" = exp [- ||X""" - X^'^Wl ] , (5.2) 
and the M x 1 vector W"'^'" by 

(A/ ^ \ / M \ 
allows the generalized kernel vector S"^'" to be defined as follows: 

The entries in S"'^"' quantify the pairwise similarities between the new point X"'^"' and 
database points consistent with the definition of S in (14. 8 p [1]. 



5.1 Restriction from physical to diffusion map variables 

The Nystrom formula ^ is used to find the eigenvector component f/j(X"'^"') associated 
with a new state vector X"'^"' 
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?7j(X"^'") = ^f2 Sr'^UjiX.^^) (5.5) 
i=i 

allowing the eigenvectors of the matrix K (and thereby the diffusion map coordinates) 
associated with X"^"" to be computed using fl4.10p . A full eigendecomposition is typically 
performed first for a representative subset of the (large) number of SSA datapoints and 
the Nystrom formula is then used to perform the restriction operation in (15.51) which 
amounts to interpolation in the diffusion map space. 



5.2 Lifting from diffusion map to physical variables 

The process of lifting (shown schematically in Figure E]) consists of preparing a detailed 
state vector with prescribed diffusion map coordinates Q^^map- '^^^ main step in our 
lifting process is the minimization of a quadratic objective function defined as follows 

06j(Qdmap(X)) = XotAQdmapi^) " QTapf (5-6) 

where Xobj is a weighting parameter that controls the shape of the objective away from 
its minimum at QdmapO^*) = Q^dmap- '^^^ implicit dependence of Qdmap on X makes 
this optimization problem nontrivial. 

We use here, for simplicity, the method of Simulated Annealing (SA) [2Sl EI] to 
solve the optimization problem, and identify a value of the state vector X* with the 
target diffusion map coordinates Q^^J^ap- '^^^ routine [31] employs a "thermahzed" 
downhill simplex method as the generator of changes in configuration. The simplex, 
consisting of + 1 vertices, each corresponding to a trial state vector, tumbles over 
the objective landscape defined by (15. 6p sampling new state vectors as it does so. The 
control parameter of the method is the "annealing temperature" which controls the rate 
of simplex motion. At high temperatures the method behaves like a global optimizer, 
accepting many proposed configurations (even those that take the simplex uphill i.e. in 
the direction of increasing objective function value). At low temperatures a local search 
is executed and only downhill simplex moves are accepted. 

The starting simplex configuration for this A^-parameter minimization may be se- 
lected at random or (more reasonably) by taking those state vectors in the existing 
database with diffusion map coordinates closest to the target Q^^J^ap- important to 
note that the SA optimization scheme requires the Nystrom formula at each iteration to 
compute Qdmapi^^""''') for trial state vectors, and thus evaluate the objective function 
value, which determines whether the configuration will be accepted or not. Once the 
objective has been evaluated at each of the starting vertices, the following steps are 
repeated until a minimum is located: 

(a) Move the simplex to generate a new state vector X*'"*'^'; 

{b) Evaluate the objective function value at the new state vector Obj{QdmapO^^^^"'''))] 
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Figure 5: A schematic of the procedure for lifting from diffusion map coordinate 
QdmapOQ to 6-dimensional state vector X via minimization of quadratic constraint po- 
tential Obj{QdmapPQ)- Target values of diffusion map coordinate are shown at the base 
of the figure, with the potential function to be minimized in each case indicated above 
these targets. For each diffusion map coordinate value shown, 3 consistent state vectors 
(generated by lifting) are indicated at top of figure. 



(c) Decrement the annealing temperature. 

The downhill simplex method prescribes the motion in step (a) making a selection from 
a set of moves according to the local objective "terrain" (set of objective values at 
the vertices encountered). Step (b) requires an evaluation using the Nystrom formula. 
We note here that this lifting strategy prepares state vectors with desired diffusion map 
coordinates using search algorithm "dynamics" . The suitability of this approach relative 
to alternatives that employ physical dynamics (e.g. using constrained evolution of the 
stochastic simulator in the spirit of the SHAKE algorithm in molecular dynamics [36] ) 
is a relevant and interesting question that merits further investigation. 

5.3 Illustrative Numerical Results 

Equipped with restriction and lifting operators between physical and "automated" vari- 
ables, we can now perform all the equation-free tasks of [13] in the diffusion map coor- 
dinate Qdmap i-e. in variable-free mode. 

A procedure for variable- free computational estimation of V{q) and D{q) in (13.4!) is 
as follows: 

[A] At the value Qdmap = Qdmap lift to a consistent state vector using the approach 
described in Section \572[ 
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Figure 6: Drift in the diffusion map coordinates. The shaded horizontal boxes indicate 
the steady state probability distribution for M = 2000. Points from SSA trajectories are 
shown at intervals of 3 x 10^. Initial configurations for these runs are those shown in 
Figurel^ prepared by lifting from Qdmap values of {—1.5, 0, 1.5). Trajectories drift towards 
the most populated regions of the distribution. 

[B] Use the state vector computed in step [A] as an initial condition for a short 
simulation burst and run multiple realizations for time At. Restrict the results 
of these simulations (Section 15. ip and use the definitions ( 13. 2p and ( 13. 3p (with 
Qdmapit) instead of Q(t)) to estimate the effective drift V (qdmap) and the effective 
diffusion coefficient D (qdmap)- 

[C] Repeat steps [A] and [B] for sufficiently many values of Qdmap and then compute 
$(g) using formula (13 ■4p and numerical quadrature. 

We performed lifting for 3 values of the automated reduction coordinate (Qdmap = 
—1.5,0, 1.5), generating several replicas in each case. From Figure [6] it is apparent that 
the selected values of Qdmap are located near the "rims" of the wells of two local minima 
on the effective free energy landscape for this system. The state vectors generated by 
lifting are shown at the top of Figure O Figure E] plots the SSA simulation evolution, 
initialized at these state vectors, in the observable Qdmap- Also shown in Figure E] is the 
steady state distribution in terms of Qdmap obtained from long SSA runs. Estimates 
for drift (V) and diffusion (D) coefficients at Qdmap values of -1.5 and produced by 
sampling the simulation database and using the lifting procedure described in this paper 
are compared in Table [2J It should be possible to reach a better agreement between 
the coefficient estimates based on the long simulation database and those obtained by 
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Table 2: Estimates for drift (V ) and diffusion (D) coefficients (in s~^) at Qdmap values 
of —1.5 and using initial conditions drawn from the simulation database and prepared 
by lifting. 

a lifting procedure if we evolve the actual model dynamics with a constraint on the 
prescribed Qdmap value - possibly through a parabolic constraint potential of the type 
used in umbrella sampling (see also the "run and reset" procedure described in [HI [13]). 
The effective free energy predicted by analyzing the full simulation database in terms 
of Qdmap can be found in Figure HI 

6 Summary and Conclusions 

The knowledge of good observables is vital in our ability to create effective reduced 
models of complex systems, and thus to analyze and even design their behavior at a 
macroscopic/engineering level more efficiently. In this paper we illustrated a connection 
between computational data-mining (in particular, diffusion maps and the resulting 
low-dimensional description of high-dimensional data) with computational multiscale 
methods (in particular, certain equation- free algorithms). Our illustrative example con- 
sisted of a model gene regulatory network known to exhibit bistable (switching) behavior 
in some regime of its parameter space. We also presented examples of lifting and restric- 
tion protocols, that enable the passing of information between detailed state space and 
reduced "diffusion map coordinate" space. These protocols allow us to "intelligently" 
design short bursts of appropriately initialized stochastic simulations with the detailed 
model simulator. Processing the results of these simulations in diffusion coordinate space 
forms the basis for the design of subsequent numerical experiments aimed at elucidating 
long-term system dynamic features (such as equilibrium densities, effective free energy 
surfaces, escape times between different wells, and their parametric dependence). In 
particular, we confirmed that previously, empirically known, observables were indeed 
meaningful coarse-grained coordinates. 

In traditional diffusion map computations, a single scalar (a scaled Euclidean norm) 
forms the basis for the identification of good reduced coordinates (when they exist). An 
important issue that arose in our example, due to the disparate nature, value ranges and 
dynamics of different data vector components, was the selection of appropriate relative 
scaling among data component values. The computational approach we used was based 
on the data ensemble, without any contribution from the dynamical proximity between 
data points collected along the same trajectory. We believe that incorporating such 
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information will be very useful in determining relative scalings among disparate data 
components; finding ways to integrate such information among data ensembles collected 
in different experiments, and possibly with different sampling rates will greatly assist in 
this direction. 

In this work, diffusion map computations were based on data collected from a single 
long transient, that was considered representative of the entire relevant portion of the 
(six-dimensional) phase space. In more realistic problems such long simulations will be 
no longer possible; yet local simulation bursts, observed on locally valid diffusion map 
coordinates can be used to guide the efficient exploration of phase space. Local smooth- 
ness in these coordinates allows us to use them in protocols such as umbrella sampling 
[411 [36] to "differentially locally extend" effective free energy surfaces. For example, 
"reverse coarse" integration described in [T71 [15] provides computational protocols for 
microscopic/stochastic simulators to track backward in time behavior, accelerating es- 
cape from free energy minima and allowing identification of saddle-type coarse-grained 
"transition states". Design of (computational) experiments for obtaining macroscopic 
information is thus complemented by the design of (computational) experiments to 
extend good low- dimensional data representations: both the coarse-grained coordinates 
and the operations we perform on them can be obtained through appropriately designed 
fine scale simulation bursts. 

In this paper the connection between diffusion maps and coarse-grained computation 
operated only in one direction: diffusion map coordinates influenced the subsequent 
design of numerical experiments. An important current research goal is to establish the 
"reverse connection": the on-line extension/modification of diffusion map coordinates 
towards sampling important, unexplored regions of phase space. 
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Appendix A Diffusion Maps 



The following discussion is largely adapted from [3l [9]. We present a criterion for 
dimensionality reduction and show how it leads to the diffusion map method. 

Suppose we have M points X*^*-* G M^, i = 1, . . . , M, and we define the matrix W by 
(14.51) . Given a mapping f : [1, . . . , M] M*^, we define the functional C by the formula 

m = Y,\m-m?w,,. (a.i) 

We see that £(f) is always nonnegative. Moreover, Wij is close to (resp. far from) one 
for vectors X^*^ and X^-'^ which are near (resp. far) from each other. For a dimensionality 
reduction function f to be useful, we must make sure that nearby points X^^^X'^-'^ in 
are mapped to nearby points f (X*^*^), f (X'--'-*) in M". To find such a mapping, one 
can solve the following minimization problem 

arg mm C{i) where F = {f : F^DF = I„, F'^Dl = 0} (A.2) 

where F is the M x n matrix with row vectors f(i), D is the M x M diagonal matrix 
with entries Da = Wij , i = 1, . . . , M, I„ is the n x n identity matrix, 1 is a vector 
of M ones, and a vector of n zeros. The first constraint removes the arbitrary scaling 
factor, while the second constraint ensures that we do not map all M points X^*) to the 
same number. Since (lA.ip can be rewritten as 

M 

m = E W - f = ^^(^"^(0 - W)F) (A.3) 

the solution F is given by the matrix of eigenvectors corresponding to the lowest eigen- 
values of the matrix 

[D - W] = Im - K (A.4) 

or equivalently by the largest eigenvalues of K. By the non-negativity of the functional 
£(f) it follows that the eigenvalues of Ia/ — K are all non-negative, or that all eigen- 
values of K are smaller than or equal to one. The eigenvector corresponding to the 
eigenvalue Aq = 1 is the vector 1. Ordering the remaining eigenvectors in decreasing 
order we see that the n-dimensional representation of A^-dimensional data points, via 
the minimization of (1A.2P is the diffusion map (14.121) . 

We note that our M points and the matrix Wij can be also viewed as the weighted full 
graph with M vertices, where the weight associated with an edge between points i and 
j is equal to W^j. Then the previous analysis can be reformulated in terms of standard 
spectral graph theory [HI [3] . More precisely, it was shown in [29] that this construction 
leads to the classical normalized graph Laplacian for a = in (14. 5p . If a = 1, then the 
construction gives the Laplace- Beltrami operator on the graph. Finally, if the data are 
produced by a stochastic (Langevin) equation, a = 1/2 provides a consistent method to 
approximate the eigenvalues and eigenvectors of the underlying stochastic problem. 
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Appendix B Simple Illustrative Examples 



We include a brief illustration of the application of the diffusion map approach to the 
well known 3-dimensional "Swiss roll" data set [iQl [351 E] (shown in left panel of Figure 
[7j) where datapoints lie along a 2- dimensional manifold. For this dataset X = [x,y,z]] 
to compute the diffusion map we use a = 1, and a = 2 in equation (14.11) . Figure 
U\ (right panel) plots these datapoints in terms of their components in the top two 
significant eigenvectors (^'i(X(*)), ^2(X*^*^)) of the matrix K for this dataset; it shows 
the "unrolled" 2-dimensional manifold detected by the diffusion map algorithm. The 
same result is obtained irrespective of the ordering (or orientation) of the dataset used 
to compute the pairwise similarity matrix. 




Figure 7: Left panel: Swiss roll dataset in M.^. Datapoints lie along a 2-dimensional man- 
ifold. Datapoints are colored by their z- coordinate value (ordering of datapoints passed 
to diffusion map routine is random). Right panel: plot o/ \E'i(X*^*^) (corresponding to 
eigenvalue \i) against \E'2(X*^*^) (corresponding to eigenvalue \i) for points in the dataset 
(same coloring scheme). The diffusion map "unrolls" the 2-dimensional manifold. 

As a second illustration, Figure [8] shows the potential 

4 4 

E{x, y) = — -x^ + 2x'^ + — + Q exp{-2{x - 2f - lOy^) (B.l) 
8 5 

which has two minima connected by two paths. A subsampling of the dataset generated 
by Monte Carlo simulation using this potential is shown in Figure [9] (left panel) with 
the corresponding diffusion map shown in the right panel of the figure. For this dataset 
X = [x,?/]; to compute the diffusion map we use a = 0, and a = 0.5 in equation (14. ip . 
Figure M (right panel) shows that points close to the bottom of the wells are mapped to 
tight clusters in the diffusion map, with a clear distinction between datapoints on each 
of the two transition pathways between the minima. 
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Figure 8: Two-well potential l\B.l\i with two connecting pathways between minima. 




Figure 9: Left panel: Suhsampled dataset generated by Monte Carlo simulation using 
potential defined in equation dB.ll) (datapoints colored by energy according to colorbar). 
Right panel: dataset diffusion map (same coloring scheme) with top eigenvalues indicated 
in inset. 
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